class: title-slide, left, bottom # Feedforward Neural Networks as Statistical Models ---- ## **Andrew McInerney**, **Kevin Burke** ### University of Limerick #### RSS Northern Ireland, 26 Oct 2022 ??? Intro. I am funded by SFI CRT in FDS. This is a multi-institutional collaboration between UL, UCD and MU. The goal of the centre is to fuse and blend the fundamentals of applied mathematics machine learning, and statistics. My research focuses on the combination of the latter two. So, neural networks are typically implemented as black-box models in machine learning, but taking a statistical perspective, I want to show how these models have similarities to the models traditionally used in statistical model. --- # Agenda -- - Feedforward Neural Networks ??? Brief intro. to the history of NNs and the early work by statisticians in the field. Then I will introduce the FNN model. -- - Statistical Perspective -- - Model Selection -- - Statistical Interpretation -- - R Implementation -- <br> Slides: [bit.ly/rss-fnn-stat](https://bit.ly/rss-fnn-stat) Code: [bit.ly/rss-fnn-stat-code](https://bit.ly/rss-fnn-stat-code) --- class: inverse middle center subsection # Feedforward Neural Networks --- # Background -- Neural networks originated from attempts to model the human brain. <br> -- Early influential papers: -- - McCulloch and Pitts (1943) -- - Rosenblatt (1958) -- - Rumelhart, Hinton and Williams (1986) --- # Background Interest within the statistics community in the late 1980s and early 1990s. -- <br> Comprehensive reviews provided by White (1989), Ripley (1993), Cheng and Titterington (1994). -- <br> However, majority of research took place outside the field of statistics (Breiman, 2001; Hooker and Mentch, 2021). ??? In the late 1980s to early 1990s, there was an interest within the statistics in neural networks. Some really comprehensive reviews of neural networks from a statistical perspective are given by White ..., Ripley, ..., and Cheng and Titterington ... However, since then, neural network research has been primarily conducted outside the field of statistics. It has primarily been conducted by computer scientists, and now machine learning researchers. But, there have been calls to try an align this research with statistics. And the potential benefit is two fold. NNs are more flexible than traditional nonlinear regression models. Also, a statistical-modelling perspective can help improve the implementation, and, more importantly, the interpretation of these models. Examples of this can be found in Rugamer and ..., which drawn on NNs to increase the flexibility of distributional regression and mixed-effects modelling. And, Agarwal which drawn on additive structure in GAMs to improve interpretability of NNs, in what they term NAMs. --- # Background Renewed interest in merging statistical models and neural networks. -- From a statistical viewpoint: -- - Distributional regression (Rugamer et al., 2020, 2021). -- - Mixed modelling (Tran et al., 2020). -- From a machine-learning viewpoint: -- - Neural Additive Models (Agarwal et al., 2020) --- # Feedforward Neural Networks -- .pull-left[ <img src="data:image/png;base64,#img/FNN.png" width="90%" height="110%" style="display: block; margin: auto;" /> ] <br> <br> ??? FNN with single hidden layer. Neural networks are often represented diagrammatically, so here we have a visual representation of a Feedforward neural network. They are made up of three different layers. -- $$ `\begin{equation} \text{NN}(x_i) = \gamma_0+\sum_{k=1}^q \gamma_k \phi \left( \sum_{j=0}^p \omega_{jk}x_{ji}\right) \end{equation}` $$ ??? Of course, we can also represent this neural network with an equation, which is given here. Explanation Viewed as a non-linear regression model, where the weights are the parameters of the model, and again the input nodes represent covariates and the hidden layer controls complexity. --- # Motivating Example -- ### Boston Housing Data (Kaggle) -- 506 communities in Boston, MA. -- Response: - `medv` (median value of owner-occupied homes) -- 12 Explanatory Variables: - `rm` (average number of rooms per dwelling) - `lstat` (proportion of population that are disadvantaged) --- # R Implementation: nnet -- ```r library(nnet) nn <- nnet(medv ~ ., data = Boston, size = 8, maxit = 5000, linout = TRUE) summary(nn) ``` -- ```{.bg-primary} ## b->h1 i1->h1 i2->h1 i3->h1 i4->h1 i5->h1 i6->h1 i7->h1 i8->h1 i9->h1 ## 2.79 5.92 0.34 1.31 0.23 -1.31 -2.67 0.77 -0.22 1.46 ## i10->h1 i11->h1 ## 1.20 1.26 ## b->h2 i1->h2 i2->h2 i3->h2 i4->h2 i5->h2 i6->h2 i7->h2 i8->h2 i9->h2 ## 20.53 5.59 3.52 -0.64 12.64 -5.25 -4.12 0.24 2.64 0.49 ## i10->h2 i11->h2 ## -21.17 4.03 ## [...] ``` ??? R packages: neuralnet, ann, nnet, keras. Nnet has history in R, and it is very easy to use. Useful for prediction. However, it is not very insightful. Looking at the output of the summary for an nnet object, we just get a list of coefficients. Unlike outputs from other statistical models, which provides us summary tables containing effects, and p-values. --- class: inverse middle center subsection # Statistical Perspective --- # Statistical Perspective -- $$ y_i = \text{NN}(x_i) + \varepsilon_i, $$ -- where $$ \varepsilon_i \sim N(0, \sigma^2) $$ <br> -- $$ \ell(\theta)= -\frac{n}{2}\log(2\pi\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^n(y_i-\text{NN}(x_i))^2 $$ --- # Uncertainty Quantification Then, as `\(n \to \infty\)` $$ \hat{\theta} \sim N[\theta, \Sigma = \mathcal{I}(\theta)^{-1}] $$ ??? Then, the asymptotic results from maximum likelihood theory apply, so as n goes to infinity, our estimated weight vector is normally distributed around the true weight vector theta and variance-covariance matrix given by the inverse of the information matrix. -- Estimate `\(\Sigma\)` using $$ \hat{\Sigma} = I_o(\hat{\theta})^{-1} $$ ??? Of course, we can estimate sigma using the observed information matrix, which we can easily compute from neural network optimiser like nnet, and this can be used in any uncertainty quantification. So this allows us to perform hypothesis tests, and calculate confidence intervals, which are not usually computed for neural networks. -- <br> However, inverting `\(I_o(\hat{\theta})\)` can be problematic in neural networks. --- # Redundancy -- Redundant hidden nodes can lead to issues of unidentifiability for some of the parameters (Fukumizu 1996). <br> ??? If we have a hidden node in our model which provides no contribution in the estimation of the response this node is redundant. This then leads to an issue of unidenitiability for all the weights that enter that hidden node from the input layer. -- Redundant hidden nodes `\(\implies\)` Singular information matrix. <br> -- Model selection is required. ??? So, while it is common in the implementation of these models to select the number of hidden nodes to be quite large to capture all non-linearities, when taking a statistical standpoint and wanting to quantify any uncertainty in your estimates (or functions thereof), model selection is required. --- class: inverse middle center subsection # Model Selection --- # Model Selection <img src="data:image/png;base64,#img/FNN-ms.png" width="65%" style="display: block; margin: auto;" /> --- count: false # Model Selection <img src="data:image/png;base64,#img/FNN-vs.png" width="65%" style="display: block; margin: auto;" /> --- count: false # Model Selection <img src="data:image/png;base64,#img/FNN-vsmc.png" width="65%" style="display: block; margin: auto;" /> --- # Proposed Approach .pull-left[ <img src="data:image/png;base64,#img/FNN1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ Three phases for model selection: {{content}} ] -- 1. Hidden-node selection {{content}} -- 2. Input-node selection {{content}} -- 3. Fine tuning {{content}} --- # Proposed Approach -- .center[ <figcaption>Hidden Node Selection</figcaption> <img src="data:image/png;base64,#img/hidden-node-2.png" height="125px"/> ] -- .center[ <figcaption>Input Node Selection</figcaption> <img src="data:image/png;base64,#img/input-node-2.png" height="125px"/> ] -- .center[ <figcaption>Fine Tune</figcaption> <img src="data:image/png;base64,#img/fine-tune-2.png" height="125px"/> ] --- # Objective Function -- - Machine Learning: -- $$ `\begin{equation} \text{Out-of-Sample MSE} = \frac{1}{n_\text{val}}\sum_{i=1}^{n_\text{val}} (y_i - NN(x_i))^2 \end{equation}` $$ -- - Proposed: -- $$ `\begin{equation} \text{BIC} = -2\ell(\hat{\theta}) + \log(n)(K + 1), \end{equation}` $$ -- $$ `\begin{equation} K = (p+2)q+1 \end{equation}` $$ --- # Simulation Setup .pull-left[ <br> True Model: `\(p = 3\)`, `\(q = 3\)` ] --- count: false # Simulation Setup .pull-left[ <br> True Model: `\(p = 3\)`, `\(q = 3\)` <br> No. unimportant inputs: `\(10\)` ] --- count: false # Simulation Setup .pull-left[ <br> True Model: `\(p = 3\)`, `\(q = 3\)` <br> No. unimportant inputs: `\(10\)` <br> Max no. hidden nodes: `\(10\)` ] -- .pull-right[ <img src="data:image/png;base64,#img/simFNN.png" width="90%" style="display: block; margin: auto;" /> ] --- # Simulation Results: Approach -- <img src="data:image/png;base64,#img/table-sim-approach.png" width="65%" style="display: block; margin: auto;" /> --- # Simulation Results: Objective Function -- <img src="data:image/png;base64,#img/table-sim-objfun.png" width="50%" style="display: block; margin: auto;" /> -- <img src="data:image/png;base64,#img/table-sim-metrics.png" width="70%" style="display: block; margin: auto;" /> --- class: inverse middle center subsection # Statistical Interpretaion --- # Hypothesis Testing -- .pull-left[ <img src="data:image/png;base64,#img/FNN1.png" width="100%" style="display: block; margin: auto;" /> ] --- count: false # Hypothesis Testing .pull-left[ <img src="data:image/png;base64,#img/FNN2.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right[ Wald test: {{content}} ] -- $$ `\begin{equation} \omega_j = (\omega_{j1},\omega_{j2},\dotsc,\omega_{jq})^T \end{equation}` $$ {{content}} -- $$ `\begin{equation} H_0: \omega_j = 0 \end{equation}` $$ {{content}} -- $$ `\begin{equation} (\hat{\omega}_{j} - \omega_j)^T\Sigma_{\hat{\omega}_{j}}^{-1}(\hat{\omega}_{j} - \omega_j) \sim \chi^2_q \end{equation}` $$ {{content}} --- # Simple Covariate Effect <br> -- $$ `\begin{equation} \hat{\tau_j} = E[\text{NN}(X)|x_{(j)} > a] - E[\text{NN}(X)|x_{(j)} < a] \end{equation}` $$ <br> -- Usually set `\(a = m_j\)`, where `\(m_j\)` is the median value of covariate `\(j\)` -- <br> Associated uncertainty via delta method / bootstrapping --- # Covariate-Effect Plots $$ `\begin{equation} \overline{\text{NN}}_j(x) = \frac{1}{n}\sum_{i=1}^n \text{NN}(x_{(i,1)}, \ldots,x_{(i,j-1)},x, x_{(i,j+1)}, \ldots, x_{(i,p)}) \end{equation}` $$ ??? If we define NN j bar of x, which is a conditional average neural network prediction over the data when all the covariates vary except for covariate j which is fixed to x. -- Propose covariate-effect plots of the following form: -- $$ `\begin{equation} \hat{\beta}_j(x,d) = \overline{\text{NN}}_j(x + d) - \overline{\text{NN}}_j(x) \end{equation}` $$ -- Usually set `\(d = \text{SD}(x_j)\)` -- Associated uncertainty via delta method. --- class: inverse middle center subsection # R Implementation --- # R Implementation -- .left-column[ <br> <img src="data:image/png;base64,#img/statnnet.png" width="80%" style="display: block; margin: auto;" /> ] -- .right-column[ <br> <br> ```r # install.packages("devtools") library(devtools) install_github("andrew-mcinerney/statnnet") ``` ] ??? We have implemented these concept in an R package, statnnet, which is currently available on GitHub, and will be on CRAN soon. This packagecan be used to perform the proposed model selection approach, and also take an nnet object as an input and calculate a more informative summary with p-values and effects. --- # Data Application (Revistied) ### Boston Housing Data (Kaggle) 506 communities in Boston, MA. -- Response: - `medv` (median value of owner-occupied homes) -- 12 Explanatory Variables: - `rm` (average number of rooms per dwelling) - `lstat` (proportion of population that are disadvantaged) ??? So, to show statnnet in action, I’m going to revisit the data application. As I said earlier, we have 506 observations of communities within Boston, the response is median value of a house, and we are focusing on two of the twelve predictors, average room size, and a measure of disadvantageness. --- # Boston Housing: Model Selection ```r library(statnnet) nn <- selectnn(medv ~ ., data = Boston, Q = 10, n_init = 10, maxit = 5000) summary(nn) ``` -- ```{.bg-primary} ## Call: ## selectnn.formula(formula = medv ~ ., data = Boston, Q = 10, n_init = 10, ## maxit = 5000) ## ## Number of input nodes: 8 ## Number of hidden nodes: 4 ## ## Value: -976.5733 ## ## Inputs: ## Covariate Selected Delta.BIC ## rm Yes 236.907 ## lstat Yes 168.023 ## dis Yes 139.305 ## nox Yes 95.203 ## ptratio Yes 59.154 ## indus Yes 38.201 ## rad Yes 35.825 ## crim Yes 8.960 ## chas No -19.769 ## age No -50.105 ## zn No -64.266 ## ## [...] ``` ??? We can perform model selection using a function called selectnn(). We supply a formula, and data, along with Q which is the maximum hidden layer size to be considerd. Then, after running model sselection, we can look at a summary. Here, I have a condensed ouput from summary. As you can can, the model selection procedure selected 8 of the 12 covariates and selected the number of hidden nodes to be 4. Then, just focusing on the two covariates, you can see both were selected, and we also report delta BIC as a measure of variable importance. This is calculated by removing the covariate, refitting the model, calculating the BIC and comparing to the model with the covariate included. --- # Boston Housing: Model Comparison <img src="data:image/png;base64,#img/modelcom_boston-1.png" width="95%" style="display: block; margin: auto;" /> --- # Boston Housing: Model Comparison <img src="data:image/png;base64,#img/modelcomp_boston_zoom-1.png" width="95%" style="display: block; margin: auto;" /> --- # Boston Housing: Model Summary ```r stnn <- statnnet(nn) summary(stnn) ``` -- ```{.bg-primary} ## [...] ## Coefficients: ## Wald ## Estimate Std. Error | X^2 Pr(> X^2) ## crim -0.115769 0.019085 | 109.8369 0.00e+00 *** ## indus -0.176500 0.018028 | 51.6302 1.65e-10 *** ## nox -0.163091 0.020639 | 39.4919 5.51e-08 *** ## rm 0.201211 0.017924 | 45.5051 3.12e-09 *** ## dis 0.101701 0.022437 | 14.6031 5.60e-03 ** ## rad -0.099667 0.019687 | 107.3354 0.00e+00 *** ## ptratio -0.192649 0.016672 | 7.8733 9.63e-02 . ## lstat -0.263402 0.014443 | 50.2500 3.20e-10 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ``` ??? We also have a statnnet function, which can take a neural network as input and calculated a more statistically-focused summary table. Here, we have a summary table for our neural network. We have point estimates and their standard error. I didn’t touch on how we calculate these in this talk but there are calculated by splitting the covariate into two ground based on the median value and looking at the the difference in the average prediction of both groups. We also get the Wald test results. So we can see here that rm and lstat are both statistically significant. --- # Boston Housing: Simple Effects <img src="data:image/png;base64,#img/BostonEffects1-1.png" width="90%" style="display: block; margin: auto;" /> --- # Boston Housing: Covariate-Effect Plots ```r plot(stnn, conf_int = TRUE, method = "deltamethod", which = c(4, 8)) ``` -- .pull-left[ <!-- --> ] -- .pull-right[ <!-- --> ] ??? And, we can also calculate the covariate-effect plots I mentioned earlier, and their associated uncertainty. Here is the code to plot rm and lstat which are the 4th and 8th covariate and their uncertainty which is estimated using the delta method. Here is the rm plot, so you can look at these plots as a varying effect plots. So we can see the effect of average number of rooms on median house price is zero when the houses are quite small, so about 3.5 rooms. But, as room number of rooms increase, the effect gets stronger, and it is positive so it increases the median house value. Then maybe it looks like when we get to an average number of rooms of just over size, the effect seems to become constant of about 0.12. For the lstat plot, we can see that when lstat is low, an increase in lstat is associated with a negative effect of about -0.1, but as lstat increase the effect becomes weaker, and then, for a value of about 29%, increasing lstat has no effect on the value of homes. --- # Summary Feedforward neural networks are non-linear regression models. -- Calculation of a likelihood function allows for uncertainty quantification. -- Statistically-based model selection is required to avoid issues of unidentifiability. -- Our R package extends existing neural network packages to allow for model selection and a more interpretable, statistically-based output. --- # References <font size="4"> R. Agarwal, N. Frosst, X. Zhang, R. Caruana and G. E. Hinton, "Neural additive models: Interpretable machine learning with neural nets," arXiv preprint arXiv:2004.13912, 2020. </font> <br> <font size="4"> K. Fukumizu, "A regularity condition of the information matrix of a multilayer perceptron network". Neural Networks, 9(5):871–879. </font> <br> <font size="4"> D. Rügamer, C. Kolb and N. Klein, "Semi-Structured Deep Distributional Regression: Combining Structured Additive Models and Deep Learning," arXiv preprint arXiv:2002.05777, 2020. </font> <br> <font size="4"> D. Rügamer, C. Kolb, C. Fritz, F. Pfisterer, B. Bischl, R. Shen, C. Bukas, L. Barros de Andrade e Sousa, D. Thalmeier, P. Baumann, N. Klein and C. L. Müller, "deepregression: a Flexible Neural Network Framework for Semi-Structured Deep Distributional Regression," arXiv preprint arXiv:2104.02705, 2021. </font> <br> <font size="4"> M.-N. Tran, N. Nguyen, D. Nott and R. Kohn, "Bayesian deep net GLM and GLMM," Journal of Computational and Graphical Statistics, vol. 29, p. 97–113, 2020. </font> --- class: final-slide # References McInerney, A. and Burke, K. (2022). A Statistically-Based Approach to Feedforward Neural Network Model Selection. arXiv preprint arXiv:2207.04248. ### R Package ```r devtools::install_github("andrew-mcinerney/statnnet") ``` <br>
<font size="5">andrew-mcinerney</font>
<font size="5">@amcinerney_</font>
<font size="5">andrew.mcinerney@ul.ie</font>